• The paper titled "E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding" presents a significant advancement in the field of video understanding through the introduction of a new benchmark designed to evaluate video-language models. The authors, affiliated with various prestigious institutions, highlight the limitations of existing benchmarks that primarily focus on video-level question-answering, which do not adequately assess the models' capabilities in understanding fine-grained events within videos. E.T. Bench, or Event-Level & Time-Sensitive Video Understanding Benchmark, is introduced as a comprehensive solution to this gap. It features a robust structure categorized into a three-level task taxonomy, encompassing 7.3K samples across 12 distinct tasks, utilizing 7K videos with a total duration of 251.4 hours across eight different domains. This extensive dataset allows for a more nuanced evaluation of video-language models, particularly in their ability to ground specific events of interest within video content. The authors conducted thorough evaluations of eight Image-LLMs and twelve Video-LLMs using this benchmark. The findings revealed that even state-of-the-art models, which perform well in coarse-level video understanding, struggle with the fine-grained tasks presented in E.T. Bench. The challenges identified include limitations related to short video context lengths, inadequate time representations, and a lack of training data that encompasses multiple events. To address these challenges, the authors propose a baseline model named E.T. Chat, along with a tailored instruction-tuning dataset called E.T. Instruct, which consists of 164K samples aimed at enhancing fine-grained event-level understanding. The proposed solutions demonstrate improved performance across various scenarios, indicating their effectiveness in advancing the capabilities of video-language models. The paper also includes detailed sections on task definitions, the generation pipeline, benchmark statistics, and comparisons with existing video-LLM benchmarks. Visualizations and evaluation results further illustrate the findings and the impact of the proposed benchmark on the field of video understanding. In conclusion, E.T. Bench represents a significant step forward in the evaluation of video-language models, providing a more detailed and diverse framework for assessing their capabilities in understanding complex events within video content. The authors encourage the academic community to utilize this benchmark and cite their work to foster further research in this area.